1 🔍 Introduction

With the growing use of data in all aspects of life, it is extremely important to use information collected from consumers in an ethical manner. While there are multiple schools of ethics to look into, it is important to note that the use of ethical practices should allow consumers to not be directly affected, thereby taking important measures to secure such sensitive information.

However, with the global rise in cyber attacks and the unethical practices through wrongul usage of data, it is no longer sufficient to simply secure such data. Steps must be taken to prevent dissemination of personal information as well as to encrypt them so as to de-identify in the event that the information is released as open data. Through the study of Floridi and Taddeo (2016), we have come to understand that it is not a specific technology (computers, tablets, mobile phones, online platforms, cloud computing and so forth), but what any digital technology manipulates that represents the correct focus of our ethical strategies. Furthermore, the prevalence of big data today is evermore critical as a result of its usefulness. However, based on a study by Kuc-Czarnecka and Olczyk (2020), large datasets coupled with complex analytical algorithms pose the risk of non-transparency, unfairness, e.g., racial or class bias, cherry-picking of data, or even intentional misleading of public opinion, including policymakers, for example by tampering with the electoral process in the context of ‘cyberwars’. For example, various well known companies use customer data to manipulate the buying behavior of a person by specifically targeting certain products.

Hence, before disseminating information to the larger public, it is important to prepare the data in an ethical manner which does not allow for any malicious actions against individuals or corporations who appear in the data. Following are the key considerations and steps taken while preparing the “loan performance open data” by ABC bank.

1.1 🕵 Key considerations before data preparation

The key steps to be taken before the preparation of the dataset are as follows:

  1. It is important to understand the data protection laws prevalent in the geographical region the bank is located in. As the bank “ABC” is situated in the United States of America, the open data that will be published must state and federal data protection laws enforced. While there is no one single data privacy rule in the USA, however, they do have largely sector specific federal and state laws such as data security laws, secure destruction, Social Security number privacy, online privacy, biometric information privacy, and data breach notification laws. These laws can be referred to in greater detail here.

  2. It is important to understand the ways in which, data can be misused so as to take the necessary steps to prevent such an event. Below are some of the common ways that data can be misused.

    • Commingling

    Commingling is when corporates or individuals capture data of a particular audience for a specific purpose but utilize the same for a separate task without citing the rightful source. Reusing data submitted for academic research, marketing purposes or sharing client data between sister organizations without consent are some of the most common commingling scenarios.

    • Personal benefit

    Personal data may be obtained so as to use it for an organization’s or individual’s personal gain. Such type of use of data could also have a malicious intent.

    • Ambiguity

    Ambiguity occurs when organizations fail to explicitly disclose how user data is collected and what that data will be used for in a concise and accessible manner.

  3. Once the required data protection laws are studied in detail, a data ethics checklist can be used to check for the required steps taken to create the re-distribute the open data on loan performance. We can refer to the data ethics checklist here Deon badge.

  4. Based on the data ethics checklist, below are some of the important considerations to be taken into account before preparation of the dataset:

    A. Data Collection

    A.1 Informed consent: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?”
    A.2 Collection bias: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?”
    A.3 Limit PII exposure: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn’t relevant for analysis?”
    A.4 Downstream bias mitigation: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?”

    B. Data Storage

    B.1 Data security: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?”
    B.2 Right to be forgotten: Do we have a mechanism through which an individual can request their personal information be removed?
    B.3 Data retention plan: Is there a schedule or plan to delete the data after it is no longer needed?

1.2 ⚠️ Removal of direct identifiers

Direct identifiers are the variables which can pin-point an individual in a dataset. Often, these variables consist of personal information which can be used with malicious intent if not removed before the data is released to the open public. The following are the variables which act as direct identifers and are removed from the raw dataset.

  • First name ( first_name )

  • Last name ( last_name )

loan_data_filtered_directid <- loan_data_raw %>% select(-c('last_name','first_name'))

1.3 🛣 Handling quasi-identifiers

Quasi-identifiers are the pieces of information which are not able to directly identify an individual but are sufficient to combine together to reasonably be able to identify information.

Various techniques will be used for suppressing sensitive information which will be delineated as follows:

1.3.1 Removal of variables

Certain quasi-identifiers will be removed from the final dataset as these set of variables may not contribute much to assess loan performance. These variables consist data of unique loan id and geographical data. Following are the list of variables that will be dropped to reduce the risk of re-identification :

  • Loan identifier ( loan_id )
  • Seller name ( seller )
  • Property state ( state )
  • Zip code short ( zip_3 )
  • Servicer name ( servicer )
  • Metropolitan Statistical Area ( msa )

1.3.2

2 Reference

Cite your data sources, and software used here.

Floridi, Luciano, and Mariarosaria Taddeo. 2016. “What Is Data Ethics?” Philosophical Transactions A of the Royal Society 374 (November). https://doi.org/10.1098/rsta.2016.0112.
Kuc-Czarnecka, Marta, and Magdalena Olczyk. 2020. “How Ethics Combine with Big Data: A Bibliometric Analysis.” Humanities and Social Sciences Communications 7 (1): 1–9.